Python for Data Science Tutorial

By:
Heba El-Shimy
PhD Scholar and Teaching Assistant
Heriot-Watt University, Dubai Campus

Based on content from Python Lecture 1
Heriot-Watt University, Edinburgh Campus
Author: Daniel Kienitz

Requirements Installation

1. Install the latest Python (3.7.4 as of the time of this tutorial):

  • Go to Python.org downloads page https://www.python.org/downloads/
  • Select your platform (Windows/MacOS/Linux)
  • Download and follow the installation steps in the wizard

2. Install Anaconda

3. Install some additional libraries needed for this tutorial

  • Open the Command Line Prompt (Windows) or the Terminal (MacOS/Linux)
  • Update Conda.

   conda update conda
  • Create a Virual Environment (best practice) by replacing "yourenvname" with any name you choose and press [y] to proceed.

   conda create -n yourenvname anaconda
  • Activate your virtual environment.

   conda activate yourenvname

Note: use deactivate to deactivate your virtual environment.

  • Install OpenCV Library (for image manipulation) and Seaborn Library (for plotting and visualization) by writing the following commands:

   conda install -c anaconda -n yourenvname seaborn
   conda install -c menpo -n yourenvname opencv3
  • Start a jupyter notebook within your environment.

   jupyter notebook
  • A Jupyter notebook server will start and will open in a new browser, if not, in your CMD or Terminal Jupyter will have generated a link for you with a security token; copy and paste that into your browser.

  • In the Jupyter webpage, click New --> Python 3 under notebooks to create a new notebook.


Top

Getting started with Jupyter Notebooks

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.

Cells expect code as input by default, but you can change the cell Format to Markdown from the toolbar above to write pieces of text like this one.

To edit a cell's content, activate that cell by clicking on it, or using your keyboard's ↑ or ↓ arrows to move through the cells. Once you reach the cell you need to edit, hit ↵.

To run some cell's content (code or markdown for text), hit "Shift + ↵". Or press "Run" from the toolbar above.


Top

Importing Libraries

In [105]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import cv2
import statsmodels.api as sm
import re

Top

Working with Pandas

In [4]:
# Reading a CSV file as dataframe and saving it into a variable
# Make the column separator the semicolon character
mtcars = pd.read_csv('mtcars.csv', sep=';')

# Print the dataframe (or part of it if its too long/wide)
mtcars
Out[4]:
Unnamed: 0 mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21 6 160 110 3,9 2,62 16,46 0 1 4 4
1 Mazda RX4 Wag 21 6 160 110 3,9 2,875 17,02 0 1 4 4
2 Datsun 710 22,8 4 108 93 3,85 2,32 18,61 1 1 4 1
3 Hornet 4 Drive 21,4 6 258 110 3,08 3,215 19,44 1 0 3 1
4 Hornet Sportabout 18,7 8 360 175 3,15 3,44 17,02 0 0 3 2
5 Valiant 18,1 6 225 105 2,76 3,46 20,22 1 0 3 1
6 Duster 360 14,3 8 360 245 3,21 3,57 15,84 0 0 3 4
7 Merc 240D 24,4 4 146,7 62 3,69 3,19 20 1 0 4 2
8 Merc 230 22,8 4 140,8 95 3,92 3,15 22,9 1 0 4 2
9 Merc 280 19,2 6 167,6 123 3,92 3,44 18,3 1 0 4 4
10 Merc 280C 17,8 6 167,6 123 3,92 3,44 18,9 1 0 4 4
11 Merc 450SE 16,4 8 275,8 180 3,07 4,07 17,4 0 0 3 3
12 Merc 450SL 17,3 8 275,8 180 3,07 3,73 17,6 0 0 3 3
13 Merc 450SLC 15,2 8 275,8 180 3,07 3,78 18 0 0 3 3
14 Cadillac Fleetwood 10,4 8 472 205 2,93 5,25 17,98 0 0 3 4
15 Lincoln Continental 10,4 8 460 215 3 5,424 17,82 0 0 3 4
16 Chrysler Imperial 14,7 8 440 230 3,23 5,345 17,42 0 0 3 4
17 Fiat 128 32,4 4 78,7 66 4,08 2,2 19,47 1 1 4 1
18 Honda Civic 30,4 4 75,7 52 4,93 1,615 18,52 1 1 4 2
19 Toyota Corolla 33,9 4 71,1 65 4,22 1,835 19,9 1 1 4 1
20 Toyota Corona 21,5 4 120,1 97 3,7 2,465 20,01 1 0 3 1
21 Dodge Challenger 15,5 8 318 150 2,76 3,52 16,87 0 0 3 2
22 AMC Javelin 15,2 8 304 150 3,15 3,435 17,3 0 0 3 2
23 Camaro Z28 13,3 8 350 245 3,73 3,84 15,41 0 0 3 4
24 Pontiac Firebird 19,2 8 400 175 3,08 3,845 17,05 0 0 3 2
25 Fiat X1-9 27,3 4 79 66 4,08 1,935 18,9 1 1 4 1
26 Porsche 914-2 26 4 120,3 91 4,43 2,14 16,7 0 1 5 2
27 Lotus Europa 30,4 4 95,1 113 3,77 1,513 16,9 1 1 5 2
28 Ford Pantera L 15,8 8 351 264 4,22 3,17 14,5 0 1 5 4
29 Ferrari Dino 19,7 6 145 175 3,62 2,77 15,5 0 1 5 6
30 Maserati Bora 15 8 301 335 3,54 3,57 14,6 0 1 5 8
31 Volvo 142E 21,4 4 121 109 4,11 2,78 18,6 1 1 4 2
In [5]:
# Read the csv file, assign the separator charater to semicolon, identfy colons as the decimal character
# no NaN values, assign first column as the dataframe index
mtcars = pd.read_csv('mtcars.csv', sep=';', decimal=',', na_values='None', index_col=0)
mtcars
Out[5]:
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
In [6]:
# Get the dataframe dimensions (rows X columns)
mtcars.shape
Out[6]:
(32, 11)
In [7]:
# Get the first 5 rows in the dataframe
mtcars.head(5)
Out[7]:
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
In [8]:
# Get the last 5 rows in the dataframe
mtcars.tail(5)
Out[8]:
mpg cyl disp hp drat wt qsec vs am gear carb
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
In [9]:
# Using Pandas built-in methods to perform operations on the dataframe
# .index() method returns a Pandas series
# .tolist() converts the Pandas series to a Python list
mtcars.index.tolist()
Out[9]:
['Mazda RX4',
 'Mazda RX4 Wag',
 'Datsun 710',
 'Hornet 4 Drive',
 'Hornet Sportabout',
 'Valiant',
 'Duster 360',
 'Merc 240D',
 'Merc 230',
 'Merc 280',
 'Merc 280C',
 'Merc 450SE',
 'Merc 450SL',
 'Merc 450SLC',
 'Cadillac Fleetwood',
 'Lincoln Continental',
 'Chrysler Imperial',
 'Fiat 128',
 'Honda Civic',
 'Toyota Corolla',
 'Toyota Corona',
 'Dodge Challenger',
 'AMC Javelin',
 'Camaro Z28',
 'Pontiac Firebird',
 'Fiat X1-9',
 'Porsche 914-2',
 'Lotus Europa',
 'Ford Pantera L',
 'Ferrari Dino',
 'Maserati Bora',
 'Volvo 142E']
In [10]:
# Type of each column
# Object type refers to strings and can support string operations
mtcars.dtypes
Out[10]:
mpg     float64
cyl       int64
disp    float64
hp        int64
drat    float64
wt      float64
qsec    float64
vs        int64
am        int64
gear      int64
carb      int64
dtype: object
In [25]:
# Print all columns
mtcars.columns
Out[25]:
Index(['mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear',
       'carb', 'lp100km'],
      dtype='object')
In [14]:
# Selecting a column in the dataframe (using column name as string)
mtcars['disp']
Out[14]:
Mazda RX4              160.0
Mazda RX4 Wag          160.0
Datsun 710             108.0
Hornet 4 Drive         258.0
Hornet Sportabout      360.0
Valiant                225.0
Duster 360             360.0
Merc 240D              146.7
Merc 230               140.8
Merc 280               167.6
Merc 280C              167.6
Merc 450SE             275.8
Merc 450SL             275.8
Merc 450SLC            275.8
Cadillac Fleetwood     472.0
Lincoln Continental    460.0
Chrysler Imperial      440.0
Fiat 128                78.7
Honda Civic             75.7
Toyota Corolla          71.1
Toyota Corona          120.1
Dodge Challenger       318.0
AMC Javelin            304.0
Camaro Z28             350.0
Pontiac Firebird       400.0
Fiat X1-9               79.0
Porsche 914-2          120.3
Lotus Europa            95.1
Ford Pantera L         351.0
Ferrari Dino           145.0
Maserati Bora          301.0
Volvo 142E             121.0
Name: disp, dtype: float64
In [15]:
# Selecting a column in the dataframe (using column position as an integer)
mtcars.iloc[:, 2]
Out[15]:
Mazda RX4              160.0
Mazda RX4 Wag          160.0
Datsun 710             108.0
Hornet 4 Drive         258.0
Hornet Sportabout      360.0
Valiant                225.0
Duster 360             360.0
Merc 240D              146.7
Merc 230               140.8
Merc 280               167.6
Merc 280C              167.6
Merc 450SE             275.8
Merc 450SL             275.8
Merc 450SLC            275.8
Cadillac Fleetwood     472.0
Lincoln Continental    460.0
Chrysler Imperial      440.0
Fiat 128                78.7
Honda Civic             75.7
Toyota Corolla          71.1
Toyota Corona          120.1
Dodge Challenger       318.0
AMC Javelin            304.0
Camaro Z28             350.0
Pontiac Firebird       400.0
Fiat X1-9               79.0
Porsche 914-2          120.3
Lotus Europa            95.1
Ford Pantera L         351.0
Ferrari Dino           145.0
Maserati Bora          301.0
Volvo 142E             121.0
Name: disp, dtype: float64
In [18]:
# Selecting a row in the dataframe (using index name as string)
mtcars.loc['Ford Pantera L', :]
Out[18]:
mpg      15.80
cyl       8.00
disp    351.00
hp      264.00
drat      4.22
wt        3.17
qsec     14.50
vs        0.00
am        1.00
gear      5.00
carb      4.00
Name: Ford Pantera L, dtype: float64
In [19]:
# Selecting a row in the dataframe (using row position as integer)
mtcars.iloc[28, :]
Out[19]:
mpg      15.80
cyl       8.00
disp    351.00
hp      264.00
drat      4.22
wt        3.17
qsec     14.50
vs        0.00
am        1.00
gear      5.00
carb      4.00
Name: Ford Pantera L, dtype: float64
In [20]:
# Test your skills

# TODO 1: Select a single cell in a certain row and column

# TODO 2: Select multiple columns
In [11]:
# Define a function to convert mpg to liter/100km
def lp100(mpg_val):
    return 100 * 3.78 / (1.6 * mpg_val)
In [21]:
# Adding a new column to the dataframe
# Select the 'mpg' column and use the apply method
# which takes a function and applies it to each value in a Pandas Series
mtcars['lp100km'] = mtcars['mpg'].apply(func=lp100)
mtcars
Out[21]:
mpg cyl disp hp drat wt qsec vs am gear carb lp100km
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 11.250000
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 11.250000
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 10.361842
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 11.039720
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 12.633690
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 13.052486
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 16.520979
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9.682377
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10.361842
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 12.304688
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 13.272472
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 14.405488
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 13.656069
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 15.542763
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 22.716346
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 22.716346
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 16.071429
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 7.291667
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 7.771382
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 6.969027
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 10.988372
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 15.241935
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 15.542763
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 17.763158
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 12.304688
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 8.653846
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 9.086538
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 7.771382
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 14.952532
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 11.992386
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 15.750000
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 11.039720
In [23]:
# Another way of doing the previous operation is using lambda functions
# applying a non-previously defined function to each item of a Series
mtcars2 = mtcars.iloc[:, :-1]
mtcars2['lp100km'] = mtcars2['mpg'].apply(lambda x: 100 * 3.78 / (1.6 * x))
mtcars2
Out[23]:
mpg cyl disp hp drat wt qsec vs am gear carb lp100km
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 11.250000
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 11.250000
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 10.361842
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 11.039720
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 12.633690
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 13.052486
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 16.520979
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9.682377
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10.361842
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 12.304688
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 13.272472
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 14.405488
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 13.656069
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 15.542763
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 22.716346
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 22.716346
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 16.071429
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 7.291667
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 7.771382
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 6.969027
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 10.988372
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 15.241935
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 15.542763
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 17.763158
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 12.304688
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 8.653846
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 9.086538
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 7.771382
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 14.952532
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 11.992386
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 15.750000
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 11.039720
In [36]:
# A third way to add columns to a dataframe using Pandas built-in functions
mtcars3 = mtcars.iloc[:, :-1]
mtcars3.insert(len(mtcars.columns.tolist())-1, 'lp100km', mtcars3['mpg'].apply(lambda x: 100 * 3.78 / (1.6 * x)))
mtcars3
Out[36]:
mpg cyl disp hp drat wt qsec vs am gear carb lp100km
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 11.250000
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 11.250000
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 10.361842
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 11.039720
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 12.633690
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 13.052486
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 16.520979
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9.682377
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10.361842
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 12.304688
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 13.272472
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 14.405488
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 13.656069
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 15.542763
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 22.716346
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 22.716346
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 16.071429
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 7.291667
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 7.771382
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 6.969027
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 10.988372
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 15.241935
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 15.542763
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 17.763158
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 12.304688
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 8.653846
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 9.086538
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 7.771382
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 14.952532
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 11.992386
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 15.750000
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 11.039720
In [26]:
# Creating a boolean mask
remove_bools = [bool(re.search(pattern='lp100', string=col_name)) for col_name in mtcars.columns]
In [28]:
remove_bools
Out[28]:
[False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True]
In [29]:
# Another way of performing the previous operation using Pandas built-in methods
remove_bools_pd = mtcars2.columns.str.contains('lp100').tolist()
remove_bools_pd
Out[29]:
[False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True]
In [31]:
# Dropping columns from a dataframe
# Using the boolean mast we created, where the column that has True as a value will be dropped
# axis defines whether what we're going to drop is a row (axis=0) or column (axis=1)
# Make sure you assign the result of running this command to the variable that holds the new dataframe
mtcars = mtcars.drop(labels=mtcars.columns[remove_bools].tolist(), axis=1)
mtcars
Out[31]:
mpg cyl disp hp drat wt qsec vs am gear carb lp100km
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 11.250000
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 11.250000
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 10.361842
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 11.039720
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 12.633690
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 13.052486
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 16.520979
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9.682377
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10.361842
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 12.304688
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 13.272472
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 14.405488
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 13.656069
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 15.542763
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 22.716346
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 22.716346
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 16.071429
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 7.291667
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 7.771382
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 6.969027
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 10.988372
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 15.241935
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 15.542763
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 17.763158
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 12.304688
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 8.653846
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 9.086538
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 7.771382
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 14.952532
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 11.992386
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 15.750000
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 11.039720
In [38]:
# Reset index (useful in cases of subsetting the data)
mtcars4 = mtcars.reset_index(drop=False)
mtcars4
Out[38]:
index mpg cyl disp hp drat wt qsec vs am gear carb lp100km
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 11.250000
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 11.250000
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 10.361842
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 11.039720
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 12.633690
5 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 13.052486
6 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 16.520979
7 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9.682377
8 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10.361842
9 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 12.304688
10 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 13.272472
11 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 14.405488
12 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 13.656069
13 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 15.542763
14 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 22.716346
15 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 22.716346
16 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 16.071429
17 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 7.291667
18 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 7.771382
19 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 6.969027
20 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 10.988372
21 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 15.241935
22 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 15.542763
23 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 17.763158
24 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 12.304688
25 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 8.653846
26 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 9.086538
27 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 7.771382
28 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 14.952532
29 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 11.992386
30 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 15.750000
31 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 11.039720
In [ ]:
# Pipes to apply a series of operations on the dataframe (prone to error)
# might be useful for batch preprocessing data using a for loop
mtcars.pipe(func_1).pipe(lambda).pipe(func, arg1, arg2)

Top

Working with OpenCV

In [40]:
# Read an image (into a numpy array)
image = cv2.imread('maserati_bora.jpg')
image
Out[40]:
array([[[166, 183, 204],
        [161, 178, 199],
        [148, 162, 184],
        ...,
        [144, 163, 184],
        [149, 169, 187],
        [152, 172, 189]],

       [[161, 178, 199],
        [159, 176, 197],
        [151, 165, 187],
        ...,
        [149, 168, 189],
        [151, 171, 189],
        [155, 175, 192]],

       [[154, 171, 192],
        [155, 172, 191],
        [153, 168, 187],
        ...,
        [153, 173, 191],
        [153, 173, 191],
        [158, 178, 195]],

       ...,

       [[ 87,  87,  81],
        [ 91,  88,  83],
        [ 87,  85,  77],
        ...,
        [ 83,  88,  91],
        [ 89,  94,  97],
        [ 99, 104, 107]],

       [[ 70,  70,  64],
        [ 83,  80,  75],
        [ 88,  86,  78],
        ...,
        [ 81,  86,  89],
        [ 90,  95,  98],
        [102, 107, 110]],

       [[ 76,  76,  70],
        [ 90,  87,  82],
        [ 92,  90,  82],
        ...,
        [ 74,  79,  82],
        [ 82,  87,  90],
        [ 95, 100, 103]]], dtype=uint8)
In [45]:
# Viewing the image
plt.figure(figsize=(15, 20))
plt.imshow(image)
Out[45]:
<matplotlib.image.AxesImage at 0x23e99a6f3c8>
In [46]:
# Colored images are read as RGB, OpenCV loads them as BGR (Blue, Green, Red)
# Images read as numpy arrays are 3-dimensional (height/rows of pixels, width/columns of pixels, color channels)
image.shape
Out[46]:
(1044, 1567, 3)
In [47]:
# Convert from OpenCV's BGR to RGB
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
In [48]:
# Show the image
plt.figure(figsize=(15, 20))
plt.imshow(image_rgb)
Out[48]:
<matplotlib.image.AxesImage at 0x23e99ad9d48>
In [49]:
## Test your skills

# TODO 1: How many color channels in a grayscale image?

# TODO 2: How to transform a colored image into grayscale image?
In [50]:
# Resizing an image
image_resize = cv2.resize(image, (48, 48))
In [51]:
image_resize.shape
Out[51]:
(48, 48, 3)
In [54]:
# Transform 3D image numpy array into raw vector of 1 row and whatever number of columns that
# can hold the data
image_resize.reshape((1, -1))
Out[54]:
array([[152, 164, 190, ...,  83,  88,  94]], dtype=uint8)
In [55]:
image_resize.shape
Out[55]:
(1, 6912)
In [ ]:
 
In [56]:
# Read an image (into a numpy array)
image2 = cv2.imread('s_klasse.jpg')
image2
Out[56]:
array([[[ 38, 120,  77],
        [ 26, 103,  59],
        [ 17,  97,  54],
        ...,
        [  7,  32,  18],
        [ 11,  32,  17],
        [ 18,  37,  20]],

       [[ 34, 108,  66],
        [ 11,  91,  48],
        [  3,  90,  52],
        ...,
        [  4,  29,  15],
        [  9,  26,  15],
        [ 14,  31,  18]],

       [[ 34, 108,  66],
        [  2,  87,  49],
        [  0,  87,  53],
        ...,
        [  5,  29,  17],
        [  6,  28,  16],
        [ 12,  32,  19]],

       ...,

       [[114,  75,  83],
        [ 51,  41,  53],
        [ 84, 107, 122],
        ...,
        [129, 127, 127],
        [126, 124, 123],
        [153, 150, 145]],

       [[121, 125, 114],
        [119, 113, 108],
        [100,  86,  88],
        ...,
        [121, 118, 120],
        [115, 117, 118],
        [129, 131, 131]],

       [[107, 110, 124],
        [101, 108, 117],
        [141, 148, 151],
        ...,
        [123, 125, 125],
        [107, 107, 107],
        [104, 104, 104]]], dtype=uint8)
In [57]:
# Viewing the image
plt.figure(figsize=(15, 20))
plt.imshow(image2)
Out[57]:
<matplotlib.image.AxesImage at 0x23e99b41708>
In [58]:
# Convert from OpenCV's BGR to RGB
image2_grayscale = cv2.cvtColor(image2, cv2.COLOR_BGR2GRAY)
In [60]:
image2_grayscale.shape
Out[60]:
(683, 1024)
In [59]:
plt.figure(figsize=(15, 20))
plt.imshow(image2_grayscale)
Out[59]:
<matplotlib.image.AxesImage at 0x23e9aa0a508>
In [65]:
plt.imshow(image2_grayscale, cmap='Blues')
Out[65]:
<matplotlib.image.AxesImage at 0x23e9aabf048>
In [67]:
plt.imshow(image2_grayscale, cmap='Greys')
Out[67]:
<matplotlib.image.AxesImage at 0x23e9ba69f48>
In [69]:
# Steps similar to creating the dataset you have for the coursework
image2_resize = cv2.resize(image2, (48, 48)).reshape((1, -1))

dataset = np.vstack([image_resize, image2_resize])

dataset
Out[69]:
array([[152, 164, 190, ...,  83,  88,  94],
       [ 41, 122,  82, ..., 115, 111, 109]], dtype=uint8)
In [70]:
dataset.shape
Out[70]:
(2, 6912)

Top

Working with Seaborn

In [78]:
# Bar plot for numbers of cars in each mpg range (calculated automatically based on the xticklabels steps)
with sns.axes_style('white'):
    g = sns.factorplot("mpg", data=mtcars, aspect=3,
                       kind="count", color='steelblue')
    g.set_xticklabels(step=2)
In [79]:
# Doing simple linear regression with density estimation for mpg vs cylinders
sns.jointplot("mpg", "cyl", data=mtcars, kind='reg')
Out[79]:
<seaborn.axisgrid.JointGrid at 0x23e9d931f88>

Top

Working with Scikit-Learn

In [81]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
In [82]:
# Create labels for our dataset
labels = np.zeros(shape=(mtcars.shape[0], 1))
labels
labels.shape
Out[82]:
(32, 1)
In [88]:
# We need to add 1 as label for Mercedes as car make and 0 for everything else
# Using a boolean mask
mtcars4['index'].str.contains('Merc')
Out[88]:
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8      True
9      True
10     True
11     True
12     True
13     True
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
Name: index, dtype: bool
In [90]:
# Set the labels according to the mask
labels[mtcars4['index'].str.contains('Merc')] = 1
labels
Out[90]:
array([[0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.]])
In [96]:
# Drop the index column as it conatins label information in text format which is not useful
# and as it's the target it should be separate from the training data
mtcars_num = mtcars4.drop(['lp100km'], axis=1).select_dtypes(include=['float64', 'int64'])
mtcars_num
Out[96]:
mpg cyl disp hp drat wt qsec vs am gear carb
0 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
5 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
6 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
7 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
8 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
9 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
10 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
11 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
12 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
13 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
14 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
15 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
16 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
17 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
18 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
19 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
20 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
21 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
22 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
23 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
24 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
25 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
26 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
27 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
28 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
29 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
30 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
31 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
In [98]:
# Instantiate a decision tree class from Scikit-Learn
dec_tree = DecisionTreeClassifier()
In [101]:
# Fit the algorithm to our data
dec_tree.fit(X=mtcars_num, y=labels)
Out[101]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
In [103]:
# Calculate the accuracy
dec_tree.score(X=mtcars_num, y=labels)
Out[103]:
1.0
In [104]:
# Plot the decision tree
# Brightness of the colors represents the confidence of the decision tree
# Gini coefficient is 0 when no further splits can be done
# Left is True and right is false
plot_tree(decision_tree=dec_tree, label=['Not Merc', 'Merc'], filled=True, feature_names=mtcars_num.columns.tolist())
Out[104]:
[Text(83.7, 199.32, 'qsec <= 17.35\n0.342\n32\n[25, 7]'),
 Text(41.85, 163.07999999999998, '0.0\n13\n[13, 0]'),
 Text(125.55000000000001, 163.07999999999998, 'disp <= 130.9\n0.465\n19\n[12, 7]'),
 Text(83.7, 126.83999999999999, '0.0\n7\n[7, 0]'),
 Text(167.4, 126.83999999999999, 'drat <= 3.035\n0.486\n12\n[5, 7]'),
 Text(125.55000000000001, 90.6, '0.0\n3\n[3, 0]'),
 Text(209.25, 90.6, 'mpg <= 14.95\n0.346\n9\n[2, 7]'),
 Text(167.4, 54.359999999999985, '0.0\n1\n[1, 0]'),
 Text(251.10000000000002, 54.359999999999985, 'carb <= 1.5\n0.219\n8\n[1, 7]'),
 Text(209.25, 18.119999999999976, '0.0\n1\n[1, 0]'),
 Text(292.95, 18.119999999999976, '0.0\n7\n[0, 7]')]

Top

Working with StatsModels

In [106]:
# Simple regressional model
reg_model = sm.OLS(endog=mtcars['hp'], exog=mtcars['mpg']).fit()
In [107]:
reg_model.summary()
Out[107]:
OLS Regression Results
Dep. Variable: hp R-squared (uncentered): 0.608
Model: OLS Adj. R-squared (uncentered): 0.595
Method: Least Squares F-statistic: 47.98
Date: Tue, 08 Oct 2019 Prob (F-statistic): 9.06e-08
Time: 04:50:21 Log-Likelihood: -193.14
No. Observations: 32 AIC: 388.3
Df Residuals: 31 BIC: 389.7
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
mpg 6.0078 0.867 6.927 0.000 4.239 7.777
Omnibus: 1.128 Durbin-Watson: 1.301
Prob(Omnibus): 0.569 Jarque-Bera (JB): 0.980
Skew: 0.209 Prob(JB): 0.613
Kurtosis: 2.251 Cond. No. 1.00


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [ ]: